Quit child processes when supervisor crashes (Linux only) #199

srwilson · 2013-01-08T04:03:05Z

If supervisor crashes, its child processes left running and inherited by supervisor's parent process. I've added a way to send these processes a stop signal if the parent goes away using prctl(PR_SET_PDEATHSIG, SIGTERM). It only works on Linux. On other platforms exceptions are caught and ignored.

I've not run it on anything besides Linux yet.

To test it, start supervisor then send it a kill -9. The child process receives the SIGTERM as expected.

Use prctl(PR_SET_PDEATHSIG, SIGTERM) to send signal when parent dies

btubbs · 2013-01-11T01:37:39Z

+1!

swistakm · 2013-01-18T12:59:08Z

This fix would be very useful.

dexterbt1 · 2013-01-23T16:57:15Z

Seen this happen once in our staging setup. This is worth the fix.

lukeweber · 2015-08-18T23:34:37Z

+1 Thought maybe only running the code when you're on linux might be better than just catching the exception and also added a constant. Happy if you want to merge the change into this PR.

https://github.com/lukeweber/supervisor/tree/zombie-process-fix

mnaberez · 2015-08-19T06:26:02Z

#649 is an alternative version of this pull request that adds a check.

lukeweber · 2015-08-19T17:34:24Z

#649 is an alternative version of this pull request that adds a check.

And I added a constant PR_SET_PDEATHSIG with a comment about where it came from in case anyone's unsure about what 1 does. Also no conflicts to merge. This pull has been open for a long time.

mnaberez · 2015-08-19T20:30:22Z

Thanks for the pull request @srwilson and @lukeweber.

First, if you guys have seen supervisord crash, do you know how it crashed and has that been fixed? Certainly it has more bugs and will crash, but let's try to fix whatever condition you saw that prompted this.

About PR_SET_PDEATHSIG:

If supervisord crashes, its subprocesses are reparented, and they should keep running. Wouldn't that actually be desirable in some cases? For example, if you are running a database or cache under it, that may prevent data loss. If supervisord crashes, and PR_SET_PDEATHSIG causes those processes to just get SIGKILLed, they're gone and have no opportunity for clean shutdown. If supervisord crashes today, it probably means a sysadmin has to intervene anyway, and at least now that person can make the call how to deal with it. Does this sound reasonable?

There's not another case in supervisord where we have to dig around with ctypes or has an OS-specific (Linux) feature. I'm not too thrilled about doing that, but if you guys think it will really help your deployment, I'm open to it. How about a setting under a [program:x] section where a PR_SET_PDEATHSIG signal can be specified (default None)? That would preserve the existing conservative behavior, but make this available.

I do agree with a change in @lukeweber's PR, if we know this is a Linux-only thing, we should only attempt it on Linux. Also, most config settings will raise ValueError which halts startup if an invalid option is specified. If we added this option, but the OS didn't support it or the call failed, I think it should raise to abort supervisord startup, forcing the user to notice and correct the config file.

btubbs · 2015-08-19T21:36:15Z

I run supervisor as a component in a PaaS (https://bitbucket.org/yougov/velociraptor/) where application instances are typically redundant and load balanced anyway. We can easily tolerate Supervisor dying and taking all its processes with it, but if it dies and leaves zombies behind, that's quite a mess to clean up.

+1 to the PR.

A Linux-specific config would be fine.

lukeweber · 2015-08-20T00:16:28Z

I don't think I had a specific crash but was playing with things when I was figuring out my code deployment.

Was using unicornherder in supervisor 3.0 and this is what I ran into.

Kill supervisord. unicornherder and gunicorn workers are running.
Start supervisord. Now I notice that I have two unicornherder processes and same worker children(4 total with same pids) as before.
Supervisorctl reports the new pid for the unicornherder, but it seems to communicate with the old works and so does the orphaned unicorn herder pid.

So I've verified besides extra unicornherder processes this isn't really a problem I think because either process does communicate with the workers, but initially I was concerned that I might be sending my signal to the wrong pid because I rely on this to update code(supervisorctl -c $SUPERVISOR_CONF pid $APP_NAME | xargs kill -s HUP) and if it wasn't talking to the right workers it would silently fail.

@mnaberez - I'm not an expert on the process logic, but could you confirm: Even though things are reparented, they aren't managed as the old running process and a new process is launched when supervisord starts?

I think it's a question of whether you want the possibilities of zombies or multiple processes running, which is sometimes even worse than nothing, but agree it's probably best as a judgement call leaving the default as is. I'm thinking of things that might not be safe to run two of here or might be strictly undesirable.

mnaberez · 2015-08-20T00:45:36Z

@mnaberez - I'm not an expert on the process logic, but could you confirm: Even though things are reparented, they aren't managed as the old running process and a new process is launched when supervisord starts?

If supervisord crashes because of a bug, or if supervisord is kill -9ed, then the supervisord process is gone but the subprocesses it spawned are still running. They no longer have supervisord as their parent so they are reparented (adopted by the init system). A new instance of supervisord doesn't have any knowledge or control over those processes. A supervisord instance only knows about the children it spawns itself.

Not quite related to this ticket, however: an orphaned process situation like you described can happen even if supervisord doesn't crash. Some programs like gunicorn spawn their own worker processes. supervisord doesn't know about those processes. supervisord only knows about the "master" gunicorn process that it started, not any processes that gunicorn spawned. In this case, you should be sure that you set killasgroup=true in the [program:gunicorn] config. If gunicorn doesn't shut down quickly enough (as defined by stopwaitsecs=) then supervisord will send SIGKILL to it. If gunicorn is killed with SIGKILL, it won't be able to shut down its workers, so they may stay around. Using killasgroup=true tells supervisord to signal the process group, so the gunicorn master and its workers all receive the kill signal.

I think it's a question of whether you want the possibilities of zombies or multiple processes running, which is sometimes even worse than nothing, but agree it's probably best as a judgement call leaving the default as is. I'm thinking of things that might not be safe to run two of here or might be strictly undesirable.

Yeah, if we add support for PR_SET_PDEATHSIG, I'm thinking it's probably safer to let the user decide in the config file if it should be used and how. I have reservations about this PR just setting it to SIGKILL all the time for the reasons I mentioned above.

lukeweber · 2015-08-26T05:39:16Z

I've updated my pull request to be a config option that can only be set on linux and I added tests for the config.

* Original pull inspired by: Supervisor#199 * If the supervisor process dies, that the child process will receive this signal set in prsetpdeathsig.

mnaberez · 2015-11-01T20:25:31Z

#649 is an updated version of this PR that includes changes from the discussion above, and still includes the two commits from this one. I am going to close this PR in favor of the new one. I've added a note there so we don't forget to credit @srwilson (the author of this patch) in the changelog.

Quit child processes when supervisor crashes (Linux only)

90acef9

Use prctl(PR_SET_PDEATHSIG, SIGTERM) to send signal when parent dies

ENG-8730 PDEATHSIG is SIGKILL instead of SIGTERM

d32be88

lukeweber mentioned this pull request Aug 19, 2015

Quit child processes when supervisor crashes (Linux only) #649

Open

lukeweber added a commit to lukeweber/supervisor that referenced this pull request Sep 30, 2015

Document updates for Supervisor#199

9729c0b

lukeweber added a commit to lukeweber/supervisor that referenced this pull request Sep 30, 2015

Document updates for Supervisor#199

c91d06d

mnaberez closed this Nov 1, 2015

This was referenced Apr 19, 2022

Feature improve shutdown behavior. Closes https://github.com/Supervisor/supervisor/issues/1101 michaelhammann/supervisor#1

Closed

Scrape ps output to try and kill any grandchild processes when stopping a process #1502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quit child processes when supervisor crashes (Linux only) #199

Quit child processes when supervisor crashes (Linux only) #199

srwilson commented Jan 8, 2013

btubbs commented Jan 11, 2013

swistakm commented Jan 18, 2013

dexterbt1 commented Jan 23, 2013

lukeweber commented Aug 18, 2015

mnaberez commented Aug 19, 2015

lukeweber commented Aug 19, 2015

mnaberez commented Aug 19, 2015

btubbs commented Aug 19, 2015

lukeweber commented Aug 20, 2015

mnaberez commented Aug 20, 2015

lukeweber commented Aug 26, 2015

mnaberez commented Nov 1, 2015

Quit child processes when supervisor crashes (Linux only) #199

Quit child processes when supervisor crashes (Linux only) #199

Conversation

srwilson commented Jan 8, 2013

btubbs commented Jan 11, 2013

swistakm commented Jan 18, 2013

dexterbt1 commented Jan 23, 2013

lukeweber commented Aug 18, 2015

mnaberez commented Aug 19, 2015

lukeweber commented Aug 19, 2015

mnaberez commented Aug 19, 2015

btubbs commented Aug 19, 2015

lukeweber commented Aug 20, 2015

mnaberez commented Aug 20, 2015

lukeweber commented Aug 26, 2015

mnaberez commented Nov 1, 2015